Judicial Decisions on Patent Validity

Cases at the German Federal Patent Court, 2000 - 2020

Judicial Decision Making in Patent Litigation

In the legal and political science literatures, there has been an ongoing debate about what drives judicial decision making for more than 100 years (Cross 2003). The legalist (or legal rationalist or legal formalist) tradition sees the judge as a vehicle executing the law. The realist (or behaviouralist) tradition sees the law as incomplete and as leaving decision making degrees of freedom to judges. Judges, then, respond to attitudinal preferences or strategic incentives (Segal 2001, Epstein & Knight 2013). This debate is strongly biased towards the USA and especially its Supreme Court, where politically controversial cases and partisan decision making are frequent.

While the realist approach has demonstrated its relevance in the setting of politicized US courts in many empirical studies (e.g. Segal & Cover 1989), patent law differs from other fields of law due to its highly technical nature. To come to a satisfactory conclusion in a patent dispute, in-depth technical knowledge relating to the case at hand is usually required in addition to legal knowledge. These differences also change the appropriateness of legalist and realist accounts of judicial decision making. At least three factors favour a legalist model of decision making in patent litigation: (1) a complete description of the facts of the case exists in the form of the patent document, (2) there are standard legal techniques (such as the fictious ‘person skilled in the art’ or criteria for validity) to derive a conclusion from the facts of the case and the law, and (3) the existence of political incentives for the judiciary in the outcomes of most patent disputes is less obvious. Given such technical, legal, and procedural constraints, a reasonable expectation is that judicial degrees of freedom are minimal:

\(H_{l}\): Under a legalist model of judicial decision making, there should be no systematic between-judge variation across comparable cases.

Despite ostensibly transparent criteria and comprehensive technical information, the reality of judicial decision making in the context of patent litigation is however often different: First, there is often deliberate uncertainty in the technical specification contained in the patent document to broaden its scope of protection (Mullally 2009), making the decision less formulaic. Second, specific legal constructions are often applied differently in different legal contexts or schools of thought, such as was the case with the ‘doctrine of equivalents’ in the famous Epilady controversy (Hatter 1994). And third, patent judges have been found to diverge in their substantive interpretations of what constitutes a patent (Lazega et al. 2017). Given these factors, an alternative hypothesis

\(H_{r}\): Under a realist model of judicial decision making, there are judge degrees of freedom, allowing for between-judge variation in decision making even for comparable cases.

The goal of this paper is to empirically test \(H_l\) against \(H_r\) within a single court, the German Federal Patent Court. Given this goal, a core task is to define what constitutes ‘comparable cases’. We here investigate three factors to control for: First, we control for the timepoint of the decision, as the observation period stretches over a period of relative turmoil in the global patenting system, which might have a systematic impact on case outcomes. Second, technology classification, as different technologies might be subject to different industry dynamics, which in turn might tip the scales systematically in the plaintiff’s or defendant’s direction. And third, senate affiliation, as specific senates might have developed their own explicit or implicit rules.

Data on Validity Cases at the Bundespatentgericht

Code
using Model
using Dictionaries, SplitApplyCombine
using DataFramesMeta, Dates

The data explored here were obtained from the BPatG online repository containing all of the court’s decisions since 2000. Initially, all 30,000 documents were downloaded. Among these, nullity decisions were selected and filtered down to only verdict documents (excluding, e.g., additional decisions on compensation etc.). This left around 1,200 documents, from which we extracted key metadata.

decisions = Model.loaddata("../data/processed/json_augmented");
# Get a look at the data for an example decision
first(decisions)
Ruling 1 Ni 4/99 (EU) on EP0389008
Date of decision: 26 September, 2000
Decided by: 1. Senate (Hacker, Vogel, Henkel, Maier, van_Raden)
Outcome: partially annulled
# count of observations in the sample
length(decisions)
1227

An outcome is coded as 0 if the claim is dismissed and as 1 if the patent is fully or partially annulled:

# count of observations by outcome label
groupcount(outcome, decisions) |> sort
3-element Dictionary{Outcome, Int64}
           Outcome(1, "annulled") │ 236
    Outcome(0, "claim dismissed") │ 295
 Outcome(1, "partially annulled") │ 696

Exploratory Results

Code
# Helper function to count and compute the share of 
# dismissed cases for a collection of decisions
function summarize_outcome(ds)
    n = length(ds)
    countf(l) = count(d -> label(outcome(d)) == l, ds) / n
    outcome_labels = ("claim dismissed", "partially annulled", "annulled")
    dismissed, partial, annulled = countf.(outcome_labels)
    (;n, dismissed, partial, annulled)
end;

# yarrrr...
function DataFrames.DataFrame(d::Dictionary)
    ks = collect(keys(d))
    vs = collect(values(d))
    df = DataFrame(vs)
    insertcols!(df, 1, :keys => ks)
end

Is there variation over time?

df = map(summarize_outcome, group(Dates.year  date, decisions)) |> DataFrame;
Code
using CairoMakie, AlgebraOfGraphics

function plot_outcome_year(df)
    outcomes = [:dismissed, :partial, :annulled]
    dflong = DataFrames.stack(df, outcomes)
    dflong.value = dflong.value .* dflong.n

    plt = data(dflong) * mapping(
            :keys => "", :value => ""; 
            stack=:variable, 
            color=:variable => ""
          ) * visual(BarPlot)
    
    draw(plt; legend=(;position=:bottom))
end

plot_outcome_year(df)

Gaussian process model
using ApproximateGPs
using Distributions
using LinearAlgebra
using LogExpFunctions: logistic, softplus, invsoftplus
using Zygote
using Optim

function build_latent_gp(θ)
    variance, lengthscale = softplus.(θ)
    kernel = variance * with_lengthscale(SqExponentialKernel(), lengthscale)
    LatentGP(GP(kernel), BernoulliLikelihood() , 1e-8)
end

function optimize_hyperparams(make_f, x, y; θ_init=invsoftplus.([1, 0.05]))
    objective = build_laplace_objective(make_f, x, y)
    grad(θ) = only(Zygote.gradient(objective, θ))
    result = Optim.optimize(objective, grad, θ_init, LBFGS(); inplace=false)
    objective, result
end

function posterior_optimize(x, y)
    objective, optimized = optimize_hyperparams(build_latent_gp, x, y)
    lf_opt = build_latent_gp(optimized.minimizer)
    posterior(LaplaceApproximation(;f_init=objective.cache.f), lf_opt(x), y)
end

function simulate()
    X = range(0, 23.5; length=48)
    f(x) = 3 * sin(10 + 0.6x) + sin(0.1x) - 1
    ps = logistic.(f.(X)) 
    Y = [rand(Bernoulli(p)) for p in ps]
    X, Y, f
end

function plot_data!(ax, x, y; true_f = nothing)
    scatter!(ax, x, y)
    !isnothing(true_f) && lines!(ax, x, f)
end
function plot_data(x, y; true_f=nothing)
    fig = Figure()
    ax = Axis(fig[1,1])
    plot_data!(ax, x, y; true_f)
    fig
end

function plot_posterior!(ax, x, y, xgrid, fpost; true_f=nothing)
    fx = fpost(xgrid, 1e-8)
    fsamples = rand(fx, 100)
    foreach(eachcol(fsamples)) do y
        lines!(ax, xgrid, logistic.(y); color=:grey80)
    end
    lines!(ax, xgrid, map(logistic  mean, eachrow(fsamples)); color=:red, linewidth=2)
    plot_data!(ax, x, y; true_f)
end
function plot_posterior(x, y, xgrid, fpost; true_f=nothing)
    fig = Figure()
    ax = Axis(fig[1,1])
    plot_posterior!(ax,x, y, xgrid, fpost; true_f)
    fig
end;
plotting code
x = [Dates.days(date(d) - minimum(date, decisions)) for d in decisions] ./ 365
y = (id  outcome).(decisions)

#post = posterior_optimize(x, y)
make_f = build_latent_gp([1.0, 5.0])
post = posterior(LaplaceApproximation(), make_f(x), y)
plot_posterior(x, y, range(0, maximum(x), 100), post)

20×5 DataFrame
Rowkeysndismissedpartialannulled
Int64Int64Float64Float64Float64
12000660.2727270.5303030.19697
22001680.3088240.5441180.147059
32002680.2058820.5882350.205882
42003690.3188410.4782610.202899
52004690.3333330.5217390.144928
62005670.2537310.5970150.149254
72007410.3414630.6097560.0487805
82008710.2535210.5774650.169014
92009690.3043480.5652170.130435
102010690.2753620.5942030.130435
112011840.226190.5833330.190476
122012760.2236840.6447370.131579
132013530.2075470.5849060.207547
142014580.05172410.7413790.206897
152015510.1568630.5294120.313725
162016450.2222220.40.377778
172017470.1702130.5957450.234043
182018520.2115380.5769230.211538
192019550.2363640.4909090.272727
202020490.1632650.551020.285714

Looking at outcomes by year, the most striking feature is missing data for 2006 and partially 2007, which is due to judge names not being recorded in the decision documents for that period. Furthermore, there seem to be fewer cases per year since about 2013. (TODO: Is this in accordance with reporting?) In terms of variation over time, there are slightly less dismissals in recent years.

Is there variation between senates?

df = map(summarize_outcome, group(label  senate, decisions)) |> DataFrame;
Code
function plot_outcome_senate(df)
    outcomes = [:dismissed, :partial, :annulled]
    dflong = DataFrames.stack(df, outcomes)
    dflong.value = dflong.value .* dflong.n

    plt = data(dflong) * mapping(
            :keys => "", :value => ""; 
            stack=:variable, 
            color=:variable => ""
          ) * visual(BarPlot)
    
    draw(plt; legend=(;position=:bottom), axis=(;xticklabelrotation=1))
end

plot_outcome_senate(df)

8×5 DataFrame
Rowkeysndismissedpartialannulled
StringInt64Float64Float64Float64
11. Senate1690.2307690.562130.207101
22. Senate2600.2423080.5615380.196154
33. Senate2640.2121210.6022730.185606
44. Senate2970.2828280.6228960.0942761
50. Senate380.2631580.4210530.315789
65. Senate1190.2352940.5042020.260504
76. Senate560.1607140.3928570.446429
87. Senate240.250.5416670.208333

There is some slight variation in case outcomes across senates, with about a 7 percentage point difference in dismissal rates between, e.g., senates 3 and 4.

Is there variation between judges?

Code
Model.label(s::String) = s

function flatten_and_summarize(decisions, by)
    df = DataFrame(group=by.(decisions), outcome=label.(outcome.(decisions)))
    df = @chain df begin
        DataFrames.flatten(:group)
        groupby([:group, :outcome])
        combine(nrow => :count)
        groupby(:group)
        @transform(:n = sum(:count))
        @rtransform(:share = :count / :n)
        unstack([:group, :n], :outcome, :share)
        @rtransform(:group = label(:group))
        sort!(:n; rev=true)
    end
end;
df = flatten_and_summarize(decisions, judges);
Code
function plot_outcome_judge(df)
    outcomes = ["claim dismissed" => "dismissed", 
                "partially annulled" => "partial", 
                "annulled" => "annulled"]

    dftop = first(rename(df, outcomes), 15)
    dflong = DataFrames.stack(dftop, last.(outcomes))
    dflong.counts = dflong.value .* dflong.n
    
    judgeorder = sort!(dftop, :dismissed).group

    plt = data(dflong) * mapping(
            :group => sorter(judgeorder) => "", :counts => ""; 
            stack=:variable, 
            color=:variable => ""
          ) * visual(BarPlot)
    
    draw(plt; legend=(;position=:bottom), axis=(;xticklabelrotation=1))
end

plot_outcome_judge(df)

200×5 DataFrame
175 rows omitted
Rowgroupnpartially annulledclaim dismissedannulled
StringInt64Float64?Float64?Float64?
1Voit1710.5614040.3157890.122807
2Gutermuth1620.5123460.2901230.197531
3Schuster1450.5034480.3172410.17931
4Sredl1440.5972220.1736110.229167
5Engels1230.6829270.2113820.105691
6Schramm1090.6788990.2018350.119266
7Müller1070.6261680.2429910.130841
8Rauch1050.5333330.3333330.133333
9Martens1010.5247520.2475250.227723
10Brandt1000.640.170.19
11Friehe930.5053760.2258060.268817
12Huber860.4302330.4186050.151163
13Guth800.66250.16250.175
189Pagenberg11.0missingmissing
190Kraft1missing1.0missing
191Sekretaruk11.0missingmissing
192Fink11.0missingmissing
193Albrecht11.0missingmissing
194Staudenmaier11.0missingmissing
195S_Schmidt1missingmissing1.0
196Kruppa11.0missingmissing
197Seyfarth1missing1.0missing
198Peters11.0missingmissing
199Philipps11.0missingmissing
200Maierbacher11.0missingmissing

Looking at some of the most active judges, there are relatively large differences in dissmissal rates; Consider as extreme cases e.g. Guth, with only 16% dismissals out of 80 cases vs. Huber, with about 42% dismissals out of 86 cases.

Is there variation betweeen technologies?

df = flatten_and_summarize(decisions, class  patent);
Code
function plot_outcome_judge(df)
    outcomes = ["claim dismissed" => "dismissed", 
                "partially annulled" => "partial", 
                "annulled" => "annulled"]

    dftop = first(rename(@rsubset(df, :group != "Y10"), outcomes), 15)
    dflong = DataFrames.stack(dftop, last.(outcomes))
    dflong.counts = dflong.value .* dflong.n
    
    order = sort!(dftop, :dismissed).group

    plt = data(dflong) * mapping(
            :group => sorter(order) => "", :counts => ""; 
            stack=:variable, 
            color=:variable => ""
          ) * visual(BarPlot)
    
    draw(plt; legend=(;position=:bottom), axis=(;xticklabelrotation=1))
end

plot_outcome_judge(df)

116×5 DataFrame
91 rows omitted
Rowgroupnpartially annulledclaim dismissedannulled
StringInt64Float64?Float64?Float64?
1Y101830.5355190.2404370.224044
2A611320.7196970.1590910.121212
3H041160.5689660.1120690.318966
4G01820.6585370.1829270.158537
5H01730.5205480.3287670.150685
6B65690.5652170.2028990.231884
7B60660.5757580.2121210.212121
8B29620.4354840.3709680.193548
9F16610.5081970.2459020.245902
10Y02590.5423730.1864410.271186
11G06560.750.07142860.178571
12B01460.5434780.1956520.26087
13E04420.5476190.3095240.142857
105D051missing1.0missing
106C1011.0missingmissing
107B431missing1.0missing
108F2711.0missingmissing
109C111missingmissing1.0
110A461missing1.0missing
111A441missing1.0missing
112B0411.0missingmissing
113C4011.0missingmissing
114B8111.0missingmissing
115F2611.0missingmissing
116A4211.0missingmissing

There is some evidence for variation in outcome across different technologies; compare, e.g., G06 (computing, calculating, counting), with a 7% dismissal rate, and B29 (working of plastics), with a 37% dismissal rate.

Is there variation across panel compositions?

df = sort!(map(summarize_outcome, group(decisions) do ds
    js = label.(first(judges(ds), 3))
    js = join(Set(js), ", ")
end); rev=true) |> DataFrame;
Code
function plot_outcome_composition(df)
    outcomes = [:dismissed, :partial, :annulled]
    dflong = first(df, 15)
    dflong = DataFrames.stack(dflong, outcomes)
    dflong.value = dflong.value .* dflong.n

    plt = data(dflong) * mapping(
            :keys => "", :value => ""; 
            stack=:variable, 
            color=:variable => ""
          ) * visual(BarPlot)
    
    draw(plt; 
         legend=(;position=:bottom), 
         axis=(;xticklabelrotation=1.4),
         figure=(;resolution=(700, 700)))
end

plot_outcome_composition(df)

523×5 DataFrame
498 rows omitted
Rowkeysndismissedpartialannulled
StringInt64Float64Float64Float64
1Frowein, Landfermann, Barton290.379310.4482760.172414
2Voit, Martens, Albertshofer180.2777780.50.222222
3Gutermuth, Meinhardt, Henkel160.3750.56250.0625
4Merzbach, Fritze, Sredl160.31250.43750.25
5Guth, Schramm, Proksch-Ledig150.2666670.6666670.0666667
6Pösentrup, Hellebrand, Köhn150.1333330.5333330.333333
7Schermer, Engels, Proksch-Ledig140.2857140.50.214286
8Egerer, Guth, Schramm120.00.6666670.333333
9Schwendy, Kalkoff, Obermayer110.3636360.6363640.0
10Schwendy, Klosterhuber, Haaß110.2727270.6363640.0909091
11Merzbach, Sredl, Lokys110.2727270.5454550.181818
12Voit, Winkler, Huber100.80.20.0
13Brandt, Hellebrand, Wagner100.30.50.2
512Bork, Schwarz, Kopacek10.00.01.0
513Klante, Musiol, Schwarz10.00.01.0
514Voit, Martens, Müller10.00.01.0
515Scholz, Martens, Bayer10.00.01.0
516Merzbach, Gottstein, Martens10.00.01.0
517Geier, Baumgart, Sandkämper10.00.01.0
518Forkel, Hartlieb, Thum-Rung10.00.01.0
519Fritze, Wiegele, Hermann10.00.01.0
520Hartlieb, Friedrich, Grote-Bittner10.00.01.0
521Zimmerer, Werner, Friehe10.00.01.0
522Rauch, Thum-Rung, Schnurr10.00.01.0
523Rauch, Fritze, Schnurr10.00.01.0

In general, this is hard to answer with the data at hand because we don’t observe individual decisions but only the collective outcome; The best we can do is look at distinct combinations of judges. When picking only the first three judges for each case and looking for all cases handled by this trio, some differences in decision outcomes become visible. However, case numbers are generally too low for conclusive results.

Does a judge’s decision making change over time?

Plots
using Dates

function plot_judges(decisions, js)
    fig = Figure(resolution=(700, 1000))
    lay = CartesianIndices((5,2))

    for (i, j) in enumerate(js)
        ds = filter(decisions) do d
            label(first(judges(d))) == j
        end

        x = [Dates.days(date(d) - minimum(date, ds)) for d in ds] ./ 365
        y = (id  outcome).(ds)

        #post = posterior_optimize(x, y)
        make_f = build_latent_gp([1.0, 4.0])
        post = posterior(LaplaceApproximation(), make_f(x), y)
        
        ax = Axis(fig[Tuple(lay[i])...]; title = j)
        plot_posterior!(ax, x, y, range(0, maximum(x), 100), post)
    end

    fig
end

js = [
        "Voit", "Schuster", 
        "Gutermuth", "Schramm", 
        "Engels", "Sredl",
        "Meinhardt", "Hellebrand", 
        "Schwendy", "Rauch"
     ]

plot_judges(decisions, js)

While for some judges there seems to be a tendency towards an increased nullification rate over their activity period (e.g, Gutermuth, Engels, Meinhardt), this is not a general trend and statistically quite uncertain.

Is there still variation between judges when controlling for the other factors?

Code
using Arrow

model = MixedMembershipModel(decisions)

post_stored = "../data/derivative/inference/posterior_mixed_membership.arrow"
samples = if isfile(post_stored)
    Arrow.Table(post_stored) |> DataFrame
else
    post = Model.sample(model, 1000, 4)
    ess, rhat, treestats = Model.checkconvergence(post)
    println(ess); println(rhat); println(treestats)
    Model._post(post)
end

function plot_mixed_membership_model(model, samples, decisions)
    empirical_mean = mean(id  outcome, decisions)

    fig = Figure(resolution=(700, 1000))

    ax_pred = Axis(fig[1,1:3]; title="Probability of (partial) annullment by case")
    ax_time = Axis(fig[2,1:3]; title="Year effects")
    ax_tech = Axis(fig[3,1:3]; title="Technology effects")
    ax_juds = Axis(fig[4,1:3]; title="Judge effects")

    Model.errorplot!(ax_pred, Prediction(), model, samples; sort=true); hlines!(ax_pred, empirical_mean; color = :red)
    Model.errorplot!(ax_time, Effect(), samples.zt .* samples.σt; sigma=1)
    Model.errorplot!(ax_tech, Effect(), samples.zc .* samples.σc; sigma=1)
    Model.errorplot!(ax_juds, Effect(), samples.zj .* samples.σj; sigma=1)

    axt = Axis(fig[5,1]; title="Year effect s.d.")
    axc = Axis(fig[5,2]; title="Technology effect s.d.")
    axj = Axis(fig[5,3]; title="Judge effect s.d.")

    density!(axt, samples.σt; color=:grey80)
    density!(axc, samples.σc; color=:grey80)
    density!(axj, samples.σj; color=:grey80)

    fig
end

plot_mixed_membership_model(model, samples, decisions)

In a mixed membership model controlling for time and patent technology, judge effects persist. Using the standard deviation of the varying judge effects as a test quantity, we would reject the null hypothesis corresponding to the legalist model of decision making.